Diptera wing classification using Topological Data Analysis
Authors
Affiliation
Guilherme Vituri F. Pinto
Universidade Estadual Paulista
Sergio Ura
Northon
Published
February 24, 2026
Abstract
We apply tools from Topological Data Analysis (TDA) to classify Diptera families based on wing venation patterns. Using multiple filtration strategies — Vietoris-Rips on point clouds, directional height filtrations (8 directions), radial filtrations, Euclidean Distance Transform filtrations, and grayscale sublevel-set (cubical) filtrations on wing images — we extract both H0 and H1 topological features (persistence images, Betti curves, persistence landscapes and summary statistics) and compare distance-based and feature-based classifiers via leave-one-out cross-validation. Feature selection via Random Forest importance and nested LOOCV provide honest, unbiased accuracy estimates.
Keywords
Topological Data Analysis, Persistent homology, Diptera classification, Wing venation
[ Info: Precompiling TDAfly [5ee89b08-7b45-4496-b6d5-71693e11a84c] (cache misses: wrong dep version loaded (2))
SYSTEM: caught exception of type :MethodError while trying to print a failed Task notice; giving up
SYSTEM: caught exception of type :MethodError while trying to print a failed Task notice; giving up
SYSTEM: caught exception of type :MethodError while trying to print a failed Task notice; giving up
[ Info: Precompiling PolynomialsMakieExt [6a4b1961-d857-5aa3-b7f6-fc7c46de29bb] (cache misses: wrong dep version loaded (2))
SYSTEM: caught exception of type :MethodError while trying to print a failed Task notice; giving up
[ Info: Precompiling StatsPlots [f3b207a7-027a-5e70-b257-86293d7955fd] (cache misses: wrong dep version loaded (6))
SYSTEM: caught exception of type :MethodError while trying to print a failed Task notice; giving up
1 Introduction
The order Diptera (true flies) comprises over 150,000 described species across more than 150 families. Wing venation patterns are a classical diagnostic character in Diptera systematics: the arrangement, branching and connectivity of veins varies markedly across families and provides a natural morphological signature.
In this work, we apply Topological Data Analysis (TDA) to the problem of classifying Diptera families from wing images. TDA provides a framework for extracting shape descriptors that are robust to continuous deformations — exactly the kind of invariance desirable when comparing biological structures that vary in scale, orientation and minor deformations across individuals.
We employ five complementary filtration strategies:
Vietoris-Rips filtration on point-cloud samples of wing silhouettes
Directional height filtrations (8 directions) that sweep across the wing along different axes
Radial filtration from the wing centroid to the periphery
Grayscale sublevel-set (cubical) filtration on the raw wing image
For each filtration, we compute both H0 (connected components / vein branching) and H1 (loops / enclosed cells) persistence, then vectorize into feature representations (persistence images, Betti curves, persistence landscapes) and feed into classifiers. Feature selection and nested cross-validation provide honest accuracy estimates.
2 Methods
2.1 Data loading and preprocessing
All images are in the images/processed directory. For each image, we load it, apply a Gaussian blur (to close small gaps in the wing membrane and keep it connected), crop to the bounding box, and resize to 150 pixels of height.
In [3]:
all_paths =readdir("images/processed", join =true)all_filenames =basename.(all_paths) .|> (x ->replace(x, ".png"=>""))functionextract_family(name) family_raw =lowercase(split(name, r"[\s\-]")[1])if family_raw in ("bibionidae", "biobionidae")return"Bibionidae"elseif family_raw in ("sciaridae", "scaridae")return"Sciaridae"elseif family_raw =="simulidae"return"Simuliidae"elsereturntitlecase(family_raw)endendfunctioncanonical_id(name) family =extract_family(name) parts =split(name, r"[\s\-]") number = parts[end]"$(family)-$(number)"end# Deduplicate (space vs hyphen variants of the same file)seen =Set{String}()keep_idx =Int[]for (i, fname) inenumerate(all_filenames) cid =canonical_id(fname)if !(cid in seen)push!(seen, cid)push!(keep_idx, i)endendpaths = all_paths[keep_idx]species = all_filenames[keep_idx]families =extract_family.(species)individuals =map(species) do specie parts =split(specie, r"[\s\-]")string(extract_family(specie)[1]) *"-"* parts[end]endprintln("Total images after deduplication: $(length(paths))")println("Families: ", sort(unique(families)))println("\nSamples per family:")for f insort(unique(families))println(" $(f): $(count(==(f), families))")end
Families with fewer than 3 samples (e.g. Pelecorhynchidae with \(n=2\)) can distort cross-validation results—a single misclassification changes accuracy by 50%. We provide a filtered version and run the analysis both ways.
In [4]:
MIN_FAMILY_SIZE =3family_counts =Dict(f =>count(==(f), families) for f inunique(families))small_families = [f for (f, c) in family_counts if c < MIN_FAMILY_SIZE]if !isempty(small_families)println("Families with < $MIN_FAMILY_SIZE samples (excluded from filtered analysis):")for f insort(small_families)println(" $(f): $(family_counts[f]) samples")endend# Build filtered indiceskeep_filtered = [i for i ineachindex(families) if family_counts[families[i]] >= MIN_FAMILY_SIZE]paths_filtered = paths[keep_filtered]species_filtered = species[keep_filtered]families_filtered = families[keep_filtered]individuals_filtered = individuals[keep_filtered]println("\nFiltered dataset: $(length(keep_filtered)) samples, $(length(unique(families_filtered))) families")
Families with < 3 samples (excluded from filtered analysis):
Pelecorhynchidae: 2 samples
Filtered dataset: 70 samples, 9 families
The chunk below selects 5 wings (prioritizing those with the largest number of disconnected components before correction), then compares the binary pixel set before and after connect_pixel_components.
We now compute persistent homology using five filtration strategies. For the Vietoris-Rips filtration on connected point clouds, H0 is uninformative (single infinite bar), so we use only H1. However, for cubical filtrations (directional, radial, EDT, grayscale), H0 is highly informative — it captures when disconnected vein segments merge as the filtration parameter grows, directly encoding vein count and branching patterns. We therefore compute both H0 and H1 for all cubical-based filtrations.
What is persistent homology?
Persistent homology is the main tool of TDA. Given a shape or dataset, it tracks how topological features — connected components (dimension 0), loops (dimension 1), voids (dimension 2), etc. — appear and disappear as we “grow” the shape through a filtration parameter. Each feature has a birth time (when it appears) and a death time (when it gets filled in). The collection of all (birth, death) pairs is called a persistence diagram. Features with long lifetimes (high persistence = death \(-\) birth) represent genuine topological structure, while short-lived features are typically noise.
3.1 Strategy 1: Vietoris-Rips filtration on point clouds
Vietoris-Rips filtration
Given a set of points in \(\mathbb{R}^n\), the Vietoris-Rips complex at scale \(\varepsilon\) connects any subset of points that are pairwise within distance \(\varepsilon\). As \(\varepsilon\) increases from 0, we obtain a nested sequence of simplicial complexes — the Rips filtration. This is the most common filtration in TDA for point-cloud data. It is computationally expensive (since it must consider all pairwise distances), which is why we subsample the point clouds.
We sample 750 points from each wing silhouette using farthest-point sampling (which ensures good coverage of the shape), then compute 1-dimensional Rips persistence:
In [9]:
samples =Vector{Any}(undef, length(Xs))Threads.@threadsfor i ineachindex(Xs) samples[i] =farthest_points_sample(Xs[i], 750)end
In [10]:
pds_rips =@showprogressmap(samples) do srips_pd_1d(s, cutoff =5, threshold =200)end;
A height filtration sweeps a hyperplane across the shape in a chosen direction and tracks topology as the “visible” region grows. For a direction vector \(v\), we assign each foreground pixel the value \(\langle (i,j), v \rangle\) (its projection onto \(v\)), then compute sublevel-set persistence. Different directions capture different geometric aspects: a horizontal sweep detects how vein loops are arranged from base to tip, a vertical sweep captures dorsal-ventral structure, and diagonal sweeps capture oblique patterns. Using multiple directions enriches the topological signature.
We compute persistence along eight directions (every 22.5°) to capture finer angular structure of vein branching, including oblique vein angles missed by 4 directions. For each direction, we extract both H0 (connected component merging = vein branching) and H1 (loop formation):
In [12]:
angles =range(0, π, length=9)[1:8]directions = [[sin(θ), cos(θ)] for θ in angles]direction_names = ["Dir_$(round(Int, rad2deg(θ)))°" for θ in angles]println("Using $(length(directions)) directions:")for (name, dir) inzip(direction_names, directions)println(" $name: $dir")end# H1 persistence (loops) — as before, but expanded to 8 directionspds_directional =Dict{String, Vector}()for (dir, name) inzip(directions, direction_names) pds_directional[name] =@showprogress"$name H1"map(wing_arrays) do Adirectional_pd_1d(A, dir)endend# H0 persistence (connected component merging = vein branching patterns)pds_directional_h0 =Dict{String, Vector}()for (dir, name) inzip(directions, direction_names) pds_directional_h0[name] =@showprogress"$name H0"map(wing_arrays) do Adirectional_pd_0d(A, dir)endend;
The radial filtration assigns each foreground pixel a value equal to its distance from the centroid of the wing. Sublevel-set persistence on this function captures how topological features (loops in the venation) are distributed from the center of the wing outward. This is complementary to the directional filtrations.
In [13]:
pds_radial =@showprogress"radial_pd_1d"map(wing_arrays) do Aradial_pd_1d(A)end;
The Euclidean Distance Transform assigns each foreground pixel the distance to the nearest background pixel. Thick veins get high EDT values. By negating the EDT as a filtration value, thick veins appear first in the sublevel-set filtration. This captures the vein thickness hierarchy — a diagnostic taxonomic character (e.g., Tabanidae have thickened costal and subcostal veins).
In [15]:
pds_edt_h1 =@showprogress"EDT H1"map(wing_arrays) do Aedt_pd_1d(A)endpds_edt_h0 =@showprogress"EDT H0"map(wing_arrays) do Aedt_pd_0d(A)end;
The function cubical_pd computes sublevel-set persistence directly on the grayscale wing image (inverted so that dark veins have low filtration values). This captures the intensity landscape of the wing image without any thresholding, preserving information about semi-transparent wing membrane regions and vein intensity gradients.
In [16]:
pds_cubical =@showprogress"Cubical"map(wing_arrays) do Acubical_pd(A; dim_max=1)endpds_cubical_h0 = [pd[1] for pd in pds_cubical]pds_cubical_h1 = [pd[2] for pd in pds_cubical];
Raw persistence diagrams live in a space that is not directly amenable to standard machine learning. We vectorize them using three approaches:
Persistence images
A persistence image is a stable, finite-dimensional representation of a persistence diagram. Each point \((b, d)\) is mapped to \((b, d - b)\) coordinates (birth vs persistence), weighted by a function that emphasizes long-lived features, then smoothed with a Gaussian kernel and discretized onto a grid. The result is a matrix (image) that can be treated as a feature vector. Persistence images are stable with respect to the Wasserstein distance and have proven effective in machine learning pipelines.
Betti curves
The Betti curve\(\beta_k(t)\) counts the number of \(k\)-dimensional features alive at filtration value \(t\). For dimension 1, it counts the number of loops present at each scale. Discretized over a grid, it produces a feature vector. Betti curves are simple, interpretable, and capture the “topological complexity” of the shape at each scale.
Persistence landscapes
A persistence landscape is a sequence of piecewise-linear functions derived from a persistence diagram. The \(k\)-th landscape \(\lambda_k\) is the \(k\)-th largest value of a collection of tent functions, one per interval. Landscapes live in a Banach space, which means we can compute means, perform hypothesis tests, and use them directly in statistical and machine learning methods. They provide a richer representation than Betti curves.
Below are examples of 1-dimensional persistence diagrams from each filtration strategy for one specimen per family:
In [19]:
example_indices = [findfirst(==(f), families) for f insort(unique(families))]for i in example_indices pers_rips =persistence.(pds_rips[i]) pers_dir =persistence.(pds_directional[direction_names[1]][i]) pers_rad =persistence.(pds_radial[i]) pers_edt =persistence.(pds_edt_h1[i]) p1 =isempty(pers_rips) ? plot(title ="Rips H₁ (empty)") :bar(sort(pers_rips, rev =true), title ="Rips H₁", legend =false, ylabel ="persistence") p2 =isempty(pers_dir) ? plot(title ="Dir 0° H₁ (empty)") :bar(sort(pers_dir, rev =true), title ="Dir 0° H₁", legend =false, ylabel ="persistence") p3 =isempty(pers_rad) ? plot(title ="Radial H₁ (empty)") :bar(sort(pers_rad, rev =true), title ="Radial H₁", legend =false, ylabel ="persistence") p4 =isempty(pers_edt) ? plot(title ="EDT H₁ (empty)") :bar(sort(pers_edt, rev =true), title ="EDT H₁", legend =false, ylabel ="persistence") p5 =scatter(last.(samples[i]), first.(samples[i]), aspect_ratio =:equal, markersize =1, legend =false, title ="Point cloud")# H0 examples pers_dir_h0 = [persistence(x) for x in pds_directional_h0[direction_names[1]][i] if isfinite(persistence(x))] p6 =isempty(pers_dir_h0) ? plot(title ="Dir 0° H₀ (empty)") :bar(sort(pers_dir_h0, rev =true), title ="Dir 0° H₀", legend =false, ylabel ="persistence") p =plot(p1, p2, p3, p4, p5, p6, layout = (3, 2), size = (900, 900), plot_title ="$(families[i]) ($(individuals[i]))")display(p)end;
3.8 Summary statistics
We extract summary statistics from each persistence diagram:
In [20]:
stat_names = ["count", "max_pers", "total_pers", "total_pers2","q10", "q25", "median", "q75", "q90", "entropy", "std_pers"]stats_rips =collect(hcat([pd_statistics(pd) for pd in pds_rips]...)')# Directional H1 statsstats_directional =Dict{String, Matrix}()for name in direction_names stats_directional[name] =collect(hcat([pd_statistics(pd) for pd in pds_directional[name]]...)')end# Directional H0 stats (NEW)stats_directional_h0 =Dict{String, Matrix}()for name in direction_names stats_directional_h0[name] =collect(hcat([pd_statistics(pd) for pd in pds_directional_h0[name]]...)')endstats_radial =collect(hcat([pd_statistics(pd) for pd in pds_radial]...)')stats_radial_h0 =collect(hcat([pd_statistics(pd) for pd in pds_radial_h0]...)') # NEW# EDT stats (NEW)stats_edt_h1 =collect(hcat([pd_statistics(pd) for pd in pds_edt_h1]...)')stats_edt_h0 =collect(hcat([pd_statistics(pd) for pd in pds_edt_h0]...)')# Cubical stats (NEW)stats_cubical_h1 =collect(hcat([pd_statistics(pd) for pd in pds_cubical_h1]...)')stats_cubical_h0 =collect(hcat([pd_statistics(pd) for pd in pds_cubical_h0]...)')# Combined stats from ALL filtrations (H1 only, for backward compatibility)stats_all_h1 =hcat(stats_rips, stats_radial, [stats_directional[name] for name in direction_names]..., stats_edt_h1, stats_cubical_h1)# Combined stats from ALL filtrations (H0 + H1 = comprehensive)stats_all =hcat( stats_rips, stats_radial, stats_radial_h0, [stats_directional[name] for name in direction_names]..., [stats_directional_h0[name] for name in direction_names]..., stats_edt_h1, stats_edt_h0, stats_cubical_h1, stats_cubical_h0,)println("Statistics dimensions:")println(" Rips H1: $(size(stats_rips))")println(" All H1 only: $(size(stats_all_h1))")println(" All H0+H1 (comprehensive): $(size(stats_all))")
Statistics dimensions:
Rips H1: (72, 11)
All H1 only: (72, 132)
All H0+H1 (comprehensive): (72, 253)
The Wasserstein distance\(W_q\) between two persistence diagrams is the cost of the optimal matching between their points (including matching points to the diagonal, which represents trivial features). With \(q=1\) it equals the Earth Mover’s Distance; with \(q=2\) it penalizes large mismatches more. The Bottleneck distance\(d_B\) is the \(\ell^\infty\) version: it measures the worst single mismatch in the optimal pairing. These distances are metrics on the space of persistence diagrams and are stable with respect to perturbations of the input data.
We compute multiple distance metrics between the persistence diagrams from each filtration:
In [25]:
labels = families# Rips-based distances (Rips PDs have ~20 intervals, so Wasserstein/Bottleneck are feasible)D_pi_rips =pairwise_distance([vec(v) for v in pi_rips])D_betti_rips =pairwise_distance(betti_rips, euclidean)D_land_rips =pairwise_distance(land1_rips_vecs, euclidean)D_wass1_rips =wasserstein_distance_matrix(pds_rips, q =1)D_wass2_rips =wasserstein_distance_matrix(pds_rips, q =2) # Phase 5: Wasserstein-2D_bott_rips =bottleneck_distance_matrix(pds_rips)# Directional/radial PDs have hundreds of intervals, so# Wasserstein/Bottleneck would be prohibitively slow on 72×72 pairs.# We use only vectorized distances (persistence images, Betti curves, landscapes),# which are Euclidean distances on fixed-size feature vectors and compute instantly.# Directional H1 distances (combine all directions via sum of per-direction distances)D_pi_dir =sum(pairwise_distance([vec(v) for v in pi_directional[name]]) for name in direction_names)D_betti_dir =sum(pairwise_distance(betti_directional[name], euclidean) for name in direction_names)# Directional H0 distances (NEW)D_pi_dir_h0 =sum(pairwise_distance([vec(v) for v in pi_directional_h0[name]]) for name in direction_names)D_betti_dir_h0 =sum(pairwise_distance(betti_directional_h0[name], euclidean) for name in direction_names)# Radial distancesD_pi_rad =pairwise_distance([vec(v) for v in pi_radial])D_betti_rad =pairwise_distance(betti_radial, euclidean)# Radial H0 distances (NEW)D_pi_rad_h0 =pairwise_distance([vec(v) for v in pi_radial_h0])D_betti_rad_h0 =pairwise_distance(betti_radial_h0, euclidean)# EDT distances (NEW)D_pi_edt_h1 =pairwise_distance([vec(v) for v in pi_edt_h1])D_betti_edt_h1 =pairwise_distance(betti_edt_h1, euclidean)D_pi_edt_h0 =pairwise_distance([vec(v) for v in pi_edt_h0])D_betti_edt_h0 =pairwise_distance(betti_edt_h0, euclidean)# Cubical distances (NEW)D_pi_cub_h1 =pairwise_distance([vec(v) for v in pi_cubical_h1])D_betti_cub_h1 =pairwise_distance(betti_cubical_h1, euclidean)D_pi_cub_h0 =pairwise_distance([vec(v) for v in pi_cubical_h0])D_betti_cub_h0 =pairwise_distance(betti_cubical_h0, euclidean)distances =Dict(# Rips"Rips PI"=> D_pi_rips,"Rips Bottleneck"=> D_bott_rips,"Rips Wass-1"=> D_wass1_rips,"Rips Wass-2"=> D_wass2_rips,"Rips Betti"=> D_betti_rips,"Rips Landscape"=> D_land_rips,# Directional H1"Directional H1 PI"=> D_pi_dir,"Directional H1 Betti"=> D_betti_dir,# Directional H0 (NEW)"Directional H0 PI"=> D_pi_dir_h0,"Directional H0 Betti"=> D_betti_dir_h0,# Radial"Radial H1 PI"=> D_pi_rad,"Radial H1 Betti"=> D_betti_rad,"Radial H0 PI"=> D_pi_rad_h0,"Radial H0 Betti"=> D_betti_rad_h0,# EDT (NEW)"EDT H1 PI"=> D_pi_edt_h1,"EDT H1 Betti"=> D_betti_edt_h1,"EDT H0 PI"=> D_pi_edt_h0,"EDT H0 Betti"=> D_betti_edt_h0,# Cubical (NEW)"Cubical H1 PI"=> D_pi_cub_h1,"Cubical H1 Betti"=> D_betti_cub_h1,"Cubical H0 PI"=> D_pi_cub_h0,"Cubical H0 Betti"=> D_betti_cub_h0,);
With only 72 samples, we use leave-one-out cross-validation: for each sample, the classifier is trained on all other samples and tested on the held-out one. The accuracy is the fraction of correctly predicted labels across all 72 folds. LOOCV has low bias (nearly the entire dataset is used for training) and is the standard validation strategy for small datasets.
5.1 Distance-based classifiers: k-NN
k-Nearest Neighbors (k-NN)
Given a precomputed distance matrix, k-NN classifies a query point by majority vote among its \(k\) nearest neighbors. Weighted k-NN weights each neighbor’s vote by \(1/d\) (inverse distance), giving closer neighbors more influence. The nearest centroid classifier assigns the query to the class whose average distance to the query is smallest. These are nonparametric methods that work directly with any distance or dissimilarity measure — making them natural for TDA, where we have principled distances between persistence diagrams.
We construct feature matrices by concatenating the vectorized TDA representations from all filtrations:
In [28]:
# Feature matrices at different levels of richnessX_stats_rips =sanitize_feature_matrix(stats_rips)X_stats_all_h1 =sanitize_feature_matrix(stats_all_h1)X_stats_all =sanitize_feature_matrix(stats_all)X_rips_full =build_feature_matrix( stats = stats_rips,pi= pi_rips, betti = betti_rips, landscape = land1_rips_vecs,) |> sanitize_feature_matrix# Multi-filtration features (H1 only): combine everythingall_pi_h1 = [vcat(vec(pi_rips[i]),vec(pi_radial[i]), [vec(pi_directional[name][i]) for name in direction_names]...,vec(pi_edt_h1[i]),vec(pi_cubical_h1[i])) for i in1:length(families)]all_betti_h1 = [vcat(betti_rips[i], betti_radial[i], [betti_directional[name][i] for name in direction_names]..., betti_edt_h1[i], betti_cubical_h1[i]) for i in1:length(families)]X_multi_h1 =build_feature_matrix( stats = stats_all_h1,pi= all_pi_h1, betti = all_betti_h1,) |> sanitize_feature_matrix# Multi-filtration features (H0 + H1 = comprehensive)all_pi = [vcat(vec(pi_rips[i]),vec(pi_radial[i]), vec(pi_radial_h0[i]), [vec(pi_directional[name][i]) for name in direction_names]..., [vec(pi_directional_h0[name][i]) for name in direction_names]...,vec(pi_edt_h1[i]), vec(pi_edt_h0[i]),vec(pi_cubical_h1[i]), vec(pi_cubical_h0[i])) for i in1:length(families)]all_betti = [vcat(betti_rips[i], betti_radial[i], betti_radial_h0[i], [betti_directional[name][i] for name in direction_names]..., [betti_directional_h0[name][i] for name in direction_names]..., betti_edt_h1[i], betti_edt_h0[i], betti_cubical_h1[i], betti_cubical_h0[i]) for i in1:length(families)]X_multi =build_feature_matrix( stats = stats_all,pi= all_pi, betti = all_betti,) |> sanitize_feature_matrixprintln("Feature dimensions:")println(" Rips stats only: $(size(X_stats_rips))")println(" All H1 stats: $(size(X_stats_all_h1))")println(" All H0+H1 stats: $(size(X_stats_all))")println(" Rips full (stats+PI+Betti+Land): $(size(X_rips_full))")println(" Multi-filtration H1 only: $(size(X_multi_h1))")println(" Multi-filtration H0+H1 (comprehensive): $(size(X_multi))")
An SVM finds the hyperplane that maximizes the margin between classes. The RBF (Radial Basis Function) kernel maps data into a high-dimensional space where linear separation becomes possible, controlled by a cost parameter \(C\) (penalty for misclassification). For distance matrices, we convert distances to an RBF-like kernel \(K(i,j) = \exp(-D_{ij}^2 / 2\sigma^2)\) and train a linear SVM on the resulting kernel matrix. This is sometimes called an “empirical kernel map.”
LDA finds a linear projection of the feature space that maximizes the ratio of between-class variance to within-class variance. The projected data is then classified with a simple 1-NN rule. LDA is a classical method that works well when classes are approximately Gaussian and the number of features is not too large relative to the number of samples. It provides an interpretable low-dimensional embedding.
In [31]:
lda_results = []for (feat_name, X) in feature_sets r =loocv_lda(X, labels)push!(lda_results, ( method ="LDA", features = feat_name, n_correct =sum(r.predictions .== labels), n_total =length(labels), accuracy = r.accuracy ))endlda_df =DataFrame(lda_results)sort!(lda_df, :accuracy, rev =true)lda_df
6×5 DataFrame
Row
method
features
n_correct
n_total
accuracy
String
String
Int64
Int64
Float64
1
LDA
Multi-filtration H0+H1
62
72
0.861111
2
LDA
All H0+H1 stats
57
72
0.791667
3
LDA
Multi-filtration H1
55
72
0.763889
4
LDA
All H1 stats
42
72
0.583333
5
LDA
Rips full
39
72
0.541667
6
LDA
Rips stats
38
72
0.527778
5.2.4 Random Forest
Random Forest
A Random Forest is an ensemble of decision trees, each trained on a bootstrap sample of the data using a random subset of features. The final prediction is the majority vote across all trees. Random Forests are robust to overfitting, handle high-dimensional features well, and provide built-in feature importance estimates. They are a strong baseline for tabular data classification tasks.
We combine the best topology-aware distance with a statistics-based distance using convex combinations: \[D_{\text{combined}}(\alpha) = \alpha \cdot D_1^* + (1 - \alpha) \cdot D_2^*\] where \(D_1^*\) and \(D_2^*\) are distances normalized to \([0, 1]\).
In [33]:
stats_for_distance =zscore_normalize(sanitize_feature_matrix(stats_all))stats_vectors_norm = [stats_for_distance[i, :] for i inaxes(stats_for_distance, 1)]D_stats =pairwise_distance(stats_vectors_norm, euclidean)# Try combining best Rips distances with stats distancegrid_rips_w1 =combined_distance_grid_search(D_wass1_rips, D_stats, labels)grid_rips_w2 =combined_distance_grid_search(D_wass2_rips, D_stats, labels)println("Top 5 combinations (Rips Wass-1 + Stats):")for r in grid_rips_w1[1:min(5, end)]println(" α=$(round(r.alpha, digits=1)), k=$(r.k): $(r.n_correct)/$(length(labels)) ($(round(r.accuracy *100, digits=1))%)")endprintln("\nTop 5 combinations (Rips Wass-2 + Stats):")for r in grid_rips_w2[1:min(5, end)]println(" α=$(round(r.alpha, digits=1)), k=$(r.k): $(r.n_correct)/$(length(labels)) ($(round(r.accuracy *100, digits=1))%)")end
# Visualize the grid searchalphas =0.0:0.1:1.0ks = [1, 3, 5]acc_grid_w1 =zeros(length(alphas), length(ks))for r in grid_rips_w1 i =findfirst(==(r.alpha), alphas) j =findfirst(==(r.k), ks)if !isnothing(i) && !isnothing(j) acc_grid_w1[i, j] = r.accuracyendendacc_grid_w2 =zeros(length(alphas), length(ks))for r in grid_rips_w2 i =findfirst(==(r.alpha), alphas) j =findfirst(==(r.k), ks)if !isnothing(i) && !isnothing(j) acc_grid_w2[i, j] = r.accuracyendendp1 =heatmap(string.(ks), string.(collect(alphas)), acc_grid_w1, xlabel ="k", ylabel ="α (Rips Wass-1 weight)", title ="Rips Wass-1 + Stats", color =:Blues, clims = (0.3, 1.0))p2 =heatmap(string.(ks), string.(collect(alphas)), acc_grid_w2, xlabel ="k", ylabel ="α (Rips Wass-2 weight)", title ="Rips Wass-2 + Stats", color =:Blues, clims = (0.3, 1.0))plot(p1, p2, layout = (1, 2), size = (1000, 450))
7 Ensemble classification
Ensemble methods (majority voting)
Ensemble methods combine predictions from multiple classifiers. In majority voting, each classifier casts a vote for its predicted class, and the class with the most votes wins. In weighted voting, each classifier’s vote is weighted by its individual accuracy, giving more influence to better classifiers. Ensembles are more robust than individual classifiers because different methods tend to make different errors.
We combine the best classifiers from each method family:
The distance-combination nested result is unstable for this dataset. Instead, we perform an honest nested LOOCV for the strongest family of models (Random Forest on statistics): the outer loop holds out one sample and the inner loop tunes RF hyperparameters using only the training fold.
Nested cross-validation
Standard LOOCV can give optimistically biased estimates when hyperparameters are tuned on the same data. Nested LOOCV adds an inner cross-validation loop: for each held-out test sample, the best hyperparameters are selected using only the training fold. This provides an unbiased estimate of generalization performance.
The Multi-filtration feature matrix X_multi has ~991 features for only 72 samples (a ~14:1 feature-to-sample ratio). In such high-dimensional settings, SVM with RBF kernel can find separating hyperplanes even for random data. Furthermore, selecting the best kernel and cost parameter from many LOOCV runs introduces selection bias: the reported accuracy of the “best” configuration is upward-biased. Nested LOOCV removes this bias by selecting hyperparameters using only the training fold.
We evaluate the Multi-filtration SVM both with and without PCA dimensionality reduction:
In [43]:
# Nested LOOCV for Multi-filtration SVM (no PCA)nested_svm_multi =nested_loocv_svm( X_multi, labels; kernels = [LIBSVM.Kernel.RadialBasis, LIBSVM.Kernel.Linear], costs = [0.1, 1.0, 10.0, 100.0], use_pca =false, inner_folds =5, rng_seed =20260223)println("=== Nested LOOCV: Multi-filtration SVM (no PCA) ===")n_corr =sum(nested_svm_multi.predictions .== labels)println("Accuracy: $(n_corr)/$(length(labels)) ($(round(nested_svm_multi.accuracy *100, digits=1))%)")println("Balanced accuracy: $(round(nested_svm_multi.balanced_accuracy *100, digits=1))%")println("Macro-F1: $(round(nested_svm_multi.macro_f1 *100, digits=1))%")ci_svm =wilson_ci(n_corr, length(labels))println("95% Wilson CI: [$(round(ci_svm.lower *100, digits=1))%, $(round(ci_svm.upper *100, digits=1))%]")# Show which hyperparameters were selected in each foldparam_counts =Dict{String, Int}()for p in nested_svm_multi.params key ="$(p.kernel), C=$(p.cost)" param_counts[key] =get(param_counts, key, 0) +1endprintln("\nSelected hyperparameters across folds:")for (k, v) insort(collect(param_counts), by=last, rev=true)println(" $k: $v/$(length(labels)) folds")end
A permutation test assesses whether the classifier’s accuracy is significantly better than chance. We shuffle the labels many times, recompute LOOCV accuracy each time, and measure how often the shuffled accuracy matches or exceeds the observed accuracy. If the observed accuracy is far above the null distribution, we can be confident the features contain genuine discriminative signal — even if the absolute accuracy estimate may be optimistically biased.
With ~$(size(X_multi, 2)) features and \(n=72\) samples, overfitting is the main accuracy bottleneck. Selecting the top features by Random Forest impurity importance reduces dimensionality and improves generalization. Critically, feature selection is performed INSIDE each LOOCV fold to avoid data leakage — each fold selects features using only training data.
# Also try on the H0+H1 stats-only feature matrix (lower dimensional)for top_k in [20, 30, 50] r_sel_stats =loocv_rf_with_selection( X_stats_all, labels; n_trees_select =500, n_trees_classify =300, top_k = top_k, balanced =true, rng_seed =20260223 ) n_corr_sel =sum(r_sel_stats.predictions .== labels)println("RF + selection on stats (top_k=$top_k): $(n_corr_sel)/$(length(labels)) ($(round(r_sel_stats.accuracy *100, digits=1))%)")end
RF + selection on stats (top_k=20): 47/72 (65.3%)
RF + selection on stats (top_k=30): 52/72 (72.2%)
RF + selection on stats (top_k=50): 46/72 (63.9%)
10.4 Nested LOOCV with multi-distance selection
Multi-distance nested LOOCV
With many distance matrices available, we need an honest way to select the best one. The inner loop evaluates all (distance, k) combinations on the training fold; the outer loop provides an unbiased accuracy estimate.
In [49]:
nested_multi_dist =nested_loocv_multi_distance( distances, labels; ks = [1, 3, 5])n_corr_multi =sum(nested_multi_dist.predictions .== labels)println("=== Nested LOOCV: Multi-distance selection ===")println("Accuracy: $(n_corr_multi)/$(length(labels)) ($(round(nested_multi_dist.accuracy *100, digits=1))%)")println("Balanced accuracy: $(round(nested_multi_dist.balanced_accuracy *100, digits=1))%")println("Macro-F1: $(round(nested_multi_dist.macro_f1 *100, digits=1))%")# Show which distances were selected most oftendist_selection_counts =Dict{String, Int}()for p in nested_multi_dist.params dist_selection_counts[p.distance] =get(dist_selection_counts, p.distance, 0) +1endprintln("\nSelected distances across folds:")for (d, v) insort(collect(dist_selection_counts), by=last, rev=true)println(" $d: $v/$(length(labels)) folds")end
We applied multiple TDA filtration strategies to classify Diptera families from wing venation images. Key findings:
Multiple filtrations are complementary: The Vietoris-Rips filtration on point-cloud samples captures the global loop structure of the wing venation. Directional height filtrations (now 8 directions) encode how topological features are spatially distributed along specific axes, the radial filtration captures the center-to-periphery organization, the EDT filtration captures vein thickness hierarchy, and cubical (grayscale sublevel-set) persistence captures intensity landscape information. Together, these views capture different geometric and topological aspects of the wing.
H0 persistence from directional/radial/EDT filtrations is informative: While H0 is uninformative for Rips on a connected point cloud, H0 from cubical filtrations captures vein branching — when disconnected vein segments merge as the sweep progresses. This is directly related to vein count and branching patterns, a key taxonomic character for Diptera families.
8 filtration directions improve coverage: Expanding from 4 to 8 directions (every 22.5°) captures oblique vein angles missed previously, providing finer angular resolution of the venation topology.
EDT filtration captures vein thickness: The Euclidean Distance Transform filtration captures the vein thickness hierarchy, which is a diagnostic character (e.g., Tabanidae have thickened C and Sc veins). This provides complementary information to structural topology.
Feature selection reduces overfitting: Random Forest feature importance-based selection (performed inside each CV fold to avoid leakage) reduces the feature-to-sample ratio dramatically, improving generalization in honest nested evaluations.
Multi-distance nested LOOCV provides honest model selection: With many distance matrices available, the nested multi-distance evaluation selects the best (distance, k) combination in the inner loop and provides an unbiased accuracy estimate in the outer loop.
Wasserstein-2 vs Wasserstein-1: Both Wasserstein distances are effective for comparing persistence diagrams, with W-2 penalizing large mismatches more heavily. The comparison reveals whether fine or coarse topological differences are more discriminative.
Statistical rigor: We report LOOCV accuracy with Wilson confidence intervals, nested LOOCV for unbiased evaluation when hyperparameters are tuned, and permutation tests to verify that observed accuracy is significantly above chance level.
11.1 Limitations
Class imbalance: Tipulidae has 12 samples while Pelecorhynchidae has only 2, which may affect some classifiers. The filtered analysis (excluding families with < 3 samples) provides a fairer comparison.
Image quality and preprocessing parameters (blur, threshold) influence topological features
The non-nested LOOCV results for feature-based classifiers are optimistically biased due to hyperparameter selection on the evaluation data. The honest comparison table should be preferred
With only 72 samples, confidence intervals remain wide regardless of method
11.2 Future work
Extend dataset with more specimens per family, especially underrepresented families
Improve imaging/segmentation quality and reevaluate image-based filtrations with less noise sensitivity
Apply extended persistence or zigzag persistence for richer invariants
XGBoost or gradient boosting classifiers for tabular feature data
Deep learning on persistence images or persistence diagrams directly